We would like to thank Merv for spending a bountiful amount of time answering questions, suggesting new tools to try, and discussing results every step of the way through the project.
Type II diabetes (T2D) occurs when the body fails to produce enough insulin and/or when insulin fails to work properly (insulin resistance). The life expectancy of people with T2D is, on average, reduced up to 10 years.\(^1\) In 2017, an estimated 462 million people worldwide were affected by T2D, accounting for about 6.3% of the world’s population.\(^2\) This prevalence of T2D is only rising, and we must explore the genetic mechanisms to understand the development of the disease.
The β-cell is one of four major types of cells present in the islets of Langerhans, which are islands of cells distributed throughout the endocrine pancreas in most mammals. The β-cell synthesizes and secretes the hormone insulin mainly in response to glucose but also in response to several nutrients, hormones and nervous stimuli.\(^5\) During insulin resistance and pre-diabetes, islet β-cells typically increase in number via proliferation of the existing cellular population. In addition, insulin output is enhanced prior to diabetes onset to compensate for reduced action of the hormone in peripheral tissues (e.g., skeletal muscle and liver). Once a decrease in islet β-cell insulin secretion and diminutions in the β-cell population or both occur, clinical onset of T2D ensues. Progression to overt diabetes requires a diminution in pancreatic islet β-cell mass, β-cell function, or both. Pancreatic islet β-cells from human subjects with T2D display markers of dedifferentiation, the process in which a cell changes from one cell type to another. For example, increased abundance of Aldh1a3 and reduced nuclear presence of the transcription factor Nkx6.1 were observed in pancreatic islets from T2D, but not in healthy controls.\(^4\) The main features associated with dedifferentiation are a reduction in some transcription factors and insulin response.\(^6\)
There are multiple rodent models for T2D, including monogenic and polygenic obese models, induced obesity models, and non-obese models. In our study, we compare the islets of wild-type (WT), heterozygous (db/+), and homozygous for diabetes (db/db) mice to detect the various genes that are impacted by dedifferentiation. Wild-type mice mark individuals of the population representing the species as it occurs in nature, containing the normal alleles, not mutations. The heterozygous samples are characterized as a pre-diabetic model, resulting from having one mutated allele. The diabetic mice contain an autosomal recessive mutation in the leptin receptor, resulting in the traits of hyperphagic, obese, hyperinsulinaemic and hyperglycaemic.\(^7\) Using RNA-seq, it is possible to compare the gene expression of the samples, possibly revealing important differences between diseased and non-diseased samples that provide insight on diabetes development for clinical application.
As of 2017, the db/db model was the most widely used mouse model of type 2 diabetes mellitus (T2DM).\(^3\) However, researchers have yet to fully characterize the process by which islets dedifferentiate in db/db mice. The pharmacotherapies that have been developed thus far cannot regain the functional loss of pancreatic β-cells. The purpose of this research was to establish db/db as a model of dedifferentiation, identify the various genes that are impacted by dedifferentiation and to determine the gene network that is directly involved in the dedifferentiation of db/db islets.\(^6\)
The publication serving as a guide for this project is found at https://link.springer.com/article/10.1007%2Fs12022-018-9523-x and has the title “RNA-Seq Analysis of Islets to Characterise the Dedifferentiation in Type 2 Diabetes Model Mice db/db”. The data was generated by Abraham Neelankal John, Ramesh Ram, and Fang-Xu Jiang. There are nine total samples involved in this study, which can be found using the GEO accession number GSE107489. These samples were split into three groups. The experimental condition was 12-week-old mice with a random blood glucose of 7-8 mM, 9-10 mM and 23-28 mM blood glucose within wild-type, db/db, and db/+ respectively. The islets of Langerhans of the mice were purified and separated from the pancreas using centrifugation, and RNA extraction of the islets followed.
The Illumina Hi-seq 2500 platform was used for RNA-seq, and 50 BP reads were generated for the islets collected from all nine mice. The data yield per individual mice islet ranged from 1.02 to 1.23 Gb. The library for RNA-seq was prepared using Illumina’s Ribo Zero Gold protocol. The library strategy is RNA-seq, with source transcriptomic, and selection cDNA. The Illumina bcl2fastq 2.20.0.422 pipeline was used to generate primary sequence read sequence with Illumina quality scores (phred-like quality +33). All the samples were checked for cross-species contamination.
To begin the analysis, we downloaded the fastq files to the SCU. We created a new folder in the SCU by first navigating to /home/nib4003/ANGSD_2021_hw/ and executed the command mkdir final_project. Here we stored all our files created on the SCU for the final project. We then began our data downloads by navigating to https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107489 where we found the 9 samples of mice to be sequenced. Under Supplementary file we clicked ‘SRA Run Selector’. On this page, we first clicked the blue checkmark box to highlight all runs, and under Download we clicked ‘Accession List’, which downloaded a file with all accession numbers for the files. Using WinSCP, we then transferred this file to the final_project folder in the SCU. Next, we used the SRA toolkit, following the documentation here: https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/, which told us how to download the files to the SCU. This required the SRA toolkit, which we found in the SCU using the command spack find. We loaded the SRA toolkit into the SCU using the command spack load sra-toolkit@2.10.7%gcc@6.3.0. Once this command was run, we used the SRA toolkit to get the fastq files, beginning with the following code.
prefetch --option-file SRR_Acc_List.txt
Prefetch is a program which downloads Runs (sequence files in the compressed SRA format) and all additional data necessary to convert the Run from the SRA format to a more commonly used format, in our case fastq. The --option-file part of the command allows us to pass our list of accession numbers instead of one accession number at a time. We next extracted all fastq files using the following code. This resulted in a list of fastq files named accession_number.fastq.
fastq-dump SRR6329219 SRR6329221 SRR6329224 SRR6329225 SRR6329226 SRR6329227 SRR6329223 SRR6329220 SRR6329222
We immediately renamed the files so that they are labeled in a manner which we know which phenotype the mouse had. We show an example for a mouse who is heterozygous, which we name heterozygous_sample1.fastq. The naming convention is diabetes for db_db (-/-), heterozygous for (+/-), and wt for wild type (+/+). After renaming all 9 samples, we compressed all files using gzip.
mv SRR6329219.fastq heterozygous_sample1.fastq
We next downloaded the reference genome and the annotations. We navigated to the website https://www.gencodegenes.org/mouse/release_M25.html which contained the most up to date information for the mouse genome used, which is GRCm38.p6 (mm10). There were columns that describe the contents and regions of the files. We took the genome annotations file (.gtf) for the primary “PRI” under Regions for the Comprehensive gene annotation files under Content. To download the reference genome, we used the command wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.primary_assembly.annotation.gtf.gz in the SCU. We then also downloaded the PRI Genome sequence, primary assembly (GRCm38) fasta file using the command wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/GRCm38.primary_assembly.genome.fa.gz. Once we had both files downloaded, we wanted to unzip them for use in genome indexing, which we did using the commands zcat gencode.vM25.primary_assembly.annotation.gtf.gz > gencode.vM25.primary_assembly.annotation.gtf and zcat GRCm38.primary_assembly.genome.fa.gz > GRCm38.primary_assembly.genome.fa. In our final_project directory, we created a new directory to store the output genome index using the command mkdir star_mouse_genome_index.
Before any alignment occurs, we first needed to run an initial qc using fastqc on all our fastq files. Using fastqc allowed us to better understand the reads by providing multiple results we could interpret, such as possible contaminating and over-represented sequences, sequencing quality using the PHRED score, GC distributions, and duplication levels of reads. In the SCU, we used the command spack load fastqc to load this tool, and executed the following command in our final_project directory, which contained all fastq files which ended in fastq.gz. This code ran fastqc on all 9 samples. We have included the fastqc data text files in the fastqc summary files folder on the final project github page.
for file in *fastq.gz; do fastqc $file --extract; done
The next step in the pipeline was running multiqc on our files output from fastqc, for the purpose of better visual comparison of the results of the qc for the initial reads of each sample. To do this, we first imported multiqc using spack load -r py-multiqc and then we used the command below to generate the final report.
mutliqc /home/nib4003/ANGSD_2021_hw/final_project/
We then exported the results of multiqc to the computer using WinSCP. By inspecting the multiqc, we saw that there are potential problems with the GC content and overrepresented sequences of diabetes sample 3 reads, as well as the sequence duplication levels of all three diabetes samples. The results for the GC content are shown below.
Figure 1. Result of multiqc GC content showing error in diabetes sample 3 output.
The red line in the per sequence GC content plot represents diabetes sample 3. We notice that the distribution of GC content for this sample contains a large peak to the right of the center of the distribution of all other samples. One possible explanation for this could have been that diabetes sample 3 was run for more PCR cycles than the other samples, resulting in the duplication of more sequences and higher bias for GC content. Below we show the output for the multiqc overrepresented sequences.
Figure 2. Result of multiqc overrepresented sequences showing error in diabetes sample 3 output.
This plot shows that diabetes sample 3 has a much greater prevalence of overrepresented sequences than the rest of the samples. This reinforces the idea that this sample could have been run for an increased number of PCR cycles compared with the other samples.
Figure 3. Result of multiqc sequence duplication levels showing error in all 3 diabetes samples.
The sequence duplication levels plot displays that there are three samples, the diabetes samples, which contain a large number of duplicated sequences in the library. Therefore, these could effect the results of downstream analysis due to the bias from the abundance of certain reads having a larger impact than others.
To further understand the origin of the overrepresented sequences for our diabetes samples, we searched through the fastqc output for overrepresented sequences in the samples, beginning with diabetes sample 3, which showed the greatest percentage of total overrepresented sequences. Using BLAST, we were able to find out where these overrepresented sequences were mapped to. The most overrepresented sequence from the fastqc output, the total number of times it was counted, and the percentage of its makeup of the total number of reads for diabetes sample 3 is shown below.
| Sequence | Count | Percentage |
|---|---|---|
| CCTTAGGCAACCTGGTGGTCCCCCGCTCCCGGGAGGTCACCATATTGATG | 269364 | 1.197866 |
When searching the overrepresented sequences of the other diabetes samples, we found this sequence is the top overrepresented sequence for all 3 diabetes samples. Also, this sequence was within the top 3 overrepresented sequences for all samples in the study. The overrepresented sequences can be found by looking at the files in the fastqc summary files folder. However, the percentage for all other files are not as significant as those from diabetes sample 3.
The results of BLAST are shown in the BLAST Search and Results folder and is titled blast_most_overrepresented_sequence_diabetes_sample3_output_Rn7s2_small_cytoplasmic_RNA.pdf. This pdf showed that the sequence belonged to a noncoding small cytoplasmic RNA in the gene Rn7s2. Therefore, we knew if the Rn7s2 gene turned out to be differentially expressed later in the analysis, we would need to show that it is not an artifact of PCR amplification for the expression to be completely valid.
We then created the reference genome and annotations. We first imported STAR using spack load star@2.7.0e and samtools using spack load samtools@1.9%gcc@6.3.0. After discussion with Merv and reading of the STAR manual, we decided to leave many of the parameters for STAR at their default values for the mouse genome. However, we used a runThreadN of 16 to create the genome index more quickly, and set the parameter sjdbOverhang to 49 bp because the reads are 50 bp long. This tells STAR how many bases to concatenate from donor and acceptor sides of the junctions. Since we had 50 bp reads, the ideal value of –sjdbOverhang was 49, which allowed the 50 bp read to map 49 bp on one side, and 1 bp on the other side. To produce our genome index and align our files, we used slurm on the SCU. To make our scripts more efficient, we created a temporary folder in buddy using the command mkdir -p /scratch/nib4003 where -p represents a parent directory. We created the scripts in the /home/nib4003 directory using the command vi star_mouse_genome_index_creation_without_alignments.sh where we provided our code, shown in the github folder Bash Scripts, with title star_mouse_genome_index_creation_without_alignments.sh. To run our script, we exited buddy to curie and ran the command sbatch star_mouse_genome_index_creation_without_alignments.sh. To check if the STAR tool was still progressing, we ran the command squeue -u nib4003 which output a blank line upon completion. The genome indexing script took a long time (9 hours!!!) to run because it is a mammalian genome. The code for genome indexing is shown below as well.
STAR --runMode genomeGenerate --runThreadN 16 --genomeDir /home/nib4003/ANGSD_2021_hw/final_project/star_mouse_genome_index --genomeFastaFiles /home/nib4003/ANGSD_2021_hw/final_project/GRCm38.primary_assembly.genome.fa --sjdbGTFfile /home/nib4003/ANGSD_2021_hw/final_project/gencode.vM25.primary_assembly.annotation.gtf --sjdbOverhang 49 --outTmpDir /scratch/nib4003/STARtmp
Once the genome index was created, we were able to run STAR alignment on the samples. We ran STAR alignment with a runThreadN count of 16 and mostly default values. We returned the alignments in a sorted BAM file, which included all SAM attributes STAR offers using the --outSAMattributes All flag, which are shown in the table below. These attributes provide extra information for the aligned reads, such as the number of hits found, alignment scores, and a source where we can look at mismatches.
| Tag | Meaning |
|---|---|
| NH | Number of reported alignments that contain the query in the current record |
| HI | Query hit index |
| AS | Alignment score generated by aligner |
| nM | Number of mismatches per (paired) alignment, |
| NM | Edit distance to the reference |
| MD | String for mismatching positions |
| jM | Intron motifs for all junctions |
| jI | Start and End of introns for all junctions |
| MC | CIGAR string for mate/next segment |
We also used slurm to align our reads to the reference genome. We created a script using the command vi star_mouse_alignments_only.sh where we stored our alignment code. To run our script, we exited buddy to curie and ran the command sbatch star_mouse_alignments_only.sh. To check if the alignment was still progressing, we ran the command squeue -u nib4003 which output a blank line upon completion. The STAR alignment script for the second diabetes sample is shown on github in the Bash Scripts folder, with title star_mouse_alignments_only.sh. The code for alignment is shown below as well for diabetes sample 2.
STAR --runMode alignReads --runThreadN 16 --genomeDir /home/nib4003/ANGSD_2021_hw/final_project/star_mouse_genome_index --readFilesIn /home/nib4003/ANGSD_2021_hw/final_project/diabetes_sample2.fastq.gz --readFilesCommand zcat --outFileNamePrefix diabetes_sample2_alignments. --outSAMtype BAM SortedByCoordinate --outSAMattributes All --outTmpDir /scratch/nib4003/STARtmp
We aligned each file separately because this process took a lot of computational power and time, and we did not want to take resources from other students for an extended period of time. Therefore, we aligned our files over a period of about a week. To align a different sample, we only had to change the filename being used. For example, to align diabetes sample 2, we changed all occurrences of diabetes_sample1 to diabetes_sample2. Notice we created a temporary directory to keep the STAR files for more efficiency using –outTmpDir /scratch/nib4003/STARtmp. This folder could not be created before running the script so we ensured this before running STAR with the command rm -rf /scratch/nib4003/STARtmp. STAR alignment took about 3 hours per file alignment. We created folders for the STAR alignments titled for each sample. For example, using command mkdir diabetes_sample1_aligned_star_reads in the final_project directory, we stored all our files specific to the diabetes 1 sample.
We next ran bamqc on the output BAM files from STAR alignment to see the status of the alignments. The code for diabetes sample 1 bamqc is shown below. We again ran bamqc one file at a time and easily changed the script for different files by changing the names of the sample being used the 3 times it shows up in the code (e.g. diabetes_sample1 -> diabetes_sample2).
/softlib/apps/EL7/BamQC/bin/bamqc /home/nib4003/ANGSD_2021_hw/final_project/diabetes_sample1_aligned_star_reads/diabetes_sample1_alignments.Aligned.sortedByCoord.out.bam --outdir /home/nib4003/ANGSD_2021_hw/final_project/diabetes_sample1_bamqc/
We exported the HTML to the computer using WinSCP and found that there were errors with the aligned reads for all diabetes samples, namely the mapping quality distribution. These graphs are shown below. The MAPQ score is 3 or less for a large proportion of the reads. A score of 3 means that the read is found to be aligned to 2 different locations, resulting in a low confidence of where it is actually mapped to. A score of 2 mean the read maps to 3 locations in the target. A score of 1 means the read maps to 4-9 locations in the target. A score of 0 means the read maps to 10 or more locations in the target. We exported the HTML summaries to the computer using WinSCP, which can be found on the github page in the folder bamqc files. We discovered that there were errors with the aligned reads for all diabetes samples, namely the mapping quality distribution. These graphs are shown below. Therefore, we noted this could be a potential problem in the downstream analyses. It is possible that the genome inherently has multiple areas where these reads could be aligned to. We decided to continue with the analyses and if we found a lack of significant genes, we would need to address the mapping quality distribution. If necessary, one method to do this could be to use Salmon or Kallisto.
Figure 4. Result of bamqc error in quality mapping distribution for diabetes sample 1.
Figure 5. Result of bamqc error in quality mapping distribution for diabetes sample 2.
Figure 6. Result of bamqc error in quality mapping distribution for diabetes sample 3.
Our next step in the pipeline was to run QoRTs on our aligned reads to provide us with additional quality control results. Before using this tool, we had to find out if our aligned reads were stranded. To accomplish this, we looked at the library preparation ribo zero gold, which is part of the TruSeq Stranded Illumina family. This information showed that the reads must be stranded. Therefore, we had to add a flag specifying this type of read to QoRTs. As another validation step, we also checked if these aligned reads were stranded using IGV. As a result, we needed to first index all aligned .bam files using samtools. We therefore had to run the command spack load samtools in the SCU. We then navigated to each folder containing the aligned STAR bam files and ran the following command using the diabetes sample 1 as an example: samtools index diabetes_sample1_alignments.Aligned.sortedByCoord.out.bam which resulted in an indexed bam file to be used with IGV. In IGV, we verified that the reads are in fact stranded by observing that they point in the same direction on the same gene. This is shown from the IGV output below as all strands are pointed in the same direction for the gene.
Figure 7. Result of IGV showing we have stranded reads for QoRTs analysis.
To use QoRTs, we downloaded the .jar file from github to a newly created folder, made using the command mkdir diabetes_sample1_qoRTs_alignment_qc, by running the command wget http://hartleys.github.io/QoRTs/QoRTs.jar. We then executed the command spack load qorts so that it was downloaded through the SCU. To use QoRTs for the first diabetes sample, we used the following code.
java -Xmx12G -jar QoRTs.jar QC --singleEnded --stranded --generatePdfReport /home/nib4003/ANGSD_2021_hw/final_project/diabetes_sample1_aligned_star_reads/diabetes_sample1_alignments.Aligned.sortedByCoord.out.bam gencode.vM25.primary_assembly.annotation.gtf /home/nib4003/ANGSD_2021_hw/final_project/diabetes_sample1_qoRTs_alignment_qc
QoRTs looks through all the reads in an aligned file and therefore it is interesting to know how many reads each fastq file initially is made up of. To do this for the first diabetes sample, we execute the command zcat diabetes_sample1.fastq.gz | wc -l which we can simply divide by 4 to get the number of reads in the fastq file. We created a table below to show the results for all samples. There are many reads per sample and so running QoRTs for one file takes about 30 minutes.
| Sample | Number of Reads |
|---|---|
| Diabetes 1 | 25,537,622 |
| Diabetes 2 | 23,087,560 |
| Diabetes 3 | 22,486,982 |
| Heterozygous 1 | 23,652,124 |
| Heterozygous 2 | 21,958,674 |
| Heterozygous 3 | 24,500,570 |
| Wild Type 1 | 24,597,444 |
| Wild Type 2 | 23,748,629 |
| Wild Type 3 | 23,233,794 |
We ran the script, titled qorts_stranded.sh in the github folder Bash Scripts, with slurm to generate the results for each aligned bam file. Notice that we had created a folder, for example using the command mkdir diabetes_sample2_qoRTs_alignment_qc, before we ran this script in the final_project directory where we keep all our files, specific to each sample.
From this output, we received files with the same name, which is why we placed them in different folders. Therefore, we executed a command to change the names of all files within each separate folder. The following is an example for diabetes sample 1: for file in *; do mv $file diabetes_sample1_$file; done. This renamed all files in the folder such that they are labeled with diabetes_sample1 to separate them from the other samples used. When this was completed, we moved the .pdf report files to the computer using WinSCP to view them, and uploaded them to the github repository for the final project under the qoRTs Files Generated folder. The QoRTs files revealed that all the samples have very similar results except for in the Read Drop Rate, by Reason plots, the diabetes samples have a much higher rate of dropping reads due to multi-mapping. This was expected due to our previous quality control.
Before running featureCounts, we created a new directory where we needed to add all aligned bam files in the final_project directory using the command mkdir all_bam_files_for_featurecounts. We then navigated to this directory and copied all bam files to the folder one at a time using, for example, the code cp /home/nib4003/ANGSD_2021_hw/final_project/diabetes_sample2_aligned_star_reads/diabetes_sample2_alignments.Aligned.sortedByCoord.out.bam . which resulted in a copy of the diabetes sample 2 aligned bam file populating the new directory. Once this was completed for all bam files, we moved on to running featureCounts.
We ran featureCounts on all our files at once and generated a report that encompassed all of them. To begin, we ran the command spack load subread in the SCU. This was because featureCounts is part of the subread software package. We then made a new directory in the ANGSD_2021_hw folder using the command mkdir counting_reads_for_all_samples and navigated into this directory. We then executed the command featureCounts –help to see all the flags for featureCounts. For our project, we did not include the -O flag which assigns reads to all their overlapping meta-features. We also left the parameter -t at the default, which specified the feature type in GTF annotation as ‘exon’. This parameter counted the number of reads that overlap with every exon. We set the parameter -g to the default value of gene_id because this corresponds to the syntax in our reference genome file. Since we had stranded reads, we set the parameter -s 2 which specified that the reads were reversely stranded. This is an important step because a substantial impact of stranded RNA-seq has been demonstrated to enhance transcriptome profiling and gene expression measurements when compared with non-stranded RNA-seq.\(^9\) We again used slurm to run a script for featureCounts, shown in the Bash Scripts folder on github, titled featurecounts_stranded_after_merv_comments.sh. The code used is shown below as well.
featureCounts -a /home/nib4003/ANGSD_2021_hw/final_project/gencode.vM25.primary_assembly.annotation.gtf -t 'exon' -g 'gene_id' -s 2 -o /home/nib4003/ANGSD_2021_hw/final_project/counting_reads_for_all_samples/final_project_featCounts_mouse_genes_stranded_after_comments_from_merv.txt /home/nib4003/ANGSD_2021_hw/final_project/all_bam_files_for_featurecounts/*.bam
This generated the final counts which we transferred to the computer using WinSCP for further analysis. Advice from Merv led us to run multiqc on the featureCounts summary file final_project_featCounts_mouse_genes_stranded_after_comments_from_merv.txt.summary using the command multiqc final_project_featCounts_mouse_genes_stranded_after_comments_from_merv.txt.summary. This resulted in an additional multiqc file which we named multiqc_report_final_project.html and is found in the SCU location /home/nib4003/ANGSD_2021_hw/final_project/counting_reads_for_all_samples. We transferred this file to the computer using WinSCP and saved the plot shown below. This demonstrated the number of reads that were assigned and used in the final counts in blue. We again noticed that the proportion of multimapped reads was much larger for the diabetes samples than the other samples, reinforcing what we showed from our bamqc quality mapping distributions.
Figure 8. Assignments of reads shown by multiqc of featureCounts summary file.
We now could move on to exploratory analysis and differential gene expression analysis.
In addition to using our counts matrix, we also downloaded the counts matrix from the publication which the author’s of the paper posted with the data on the gene expression omnibus page in the supplementary file section, titled “GSE107489_gene_counts.txt.gz”. We performed exploratory analysis and differential gene expresssion analysis on the diabetic vs. wild-type mice for these counts to compare the results both with those from the published paper and with those received from the analyses using our counts matrix.
Both files are found in github in the FeatureCounts Files For Final Writeup folder.
All code for results can be found in the github folder FeatureCounts Analysis and DE Analysis in either the RMD or HTML files titled “Performing-Analysis-Using-Read-Count-Table”. The necessary code to generate the objects we used to create the figures below are shown in the following cells and annotated. The first cell below is the code used to create DESeq.rlog, which we use to make the PCA plot.
# Importing counts from our featureCounts run for all samples
readcounts <- paste0("final_project_featCounts.txt") %>% read.table(., header=TRUE)
# Rename the columns to specify sample
names(readcounts) <-c(names(readcounts)[1:6],paste("DIABETES",c(1:3), sep = "_"),paste("HETEROZYGOUS",c(1:3), sep = "_"),paste("WT",c(1:3),sep="_") )
# Make rownames the genes
row.names(readcounts) <- make.names(readcounts$Geneid)
# Exclude the columns without read counts (columns 1 to 6 contain additional info such as genomic coordinates)
readcounts <- readcounts[ ,-c(1:6)]
# Create a dataframe for the gene information for each sample
sample_info <- DataFrame(condition =gsub("_[0-9]+", "",names(readcounts)),row.names =names(readcounts) )
#Generate the DESeqDataSet from our counts for all the samples
DESeq.ds <- DESeqDataSetFromMatrix(countData = as.matrix(readcounts), colData = sample_info, design = ~ condition)
# Keep only the genes which contain counts for any of the samples
keep_genes <- rowSums(counts(DESeq.ds)) > 0
DESeq.ds <- DESeq.ds[ keep_genes, ]
# Reduce the dependence of the variance on the mean using rlog
DESeq.rlog <-rlog(DESeq.ds, blind = TRUE)
Next, we show the necessary code to create the objects we used to generate the figures for the heterozygous vs. wild-type comparison.
## Take only the first 6 columns of the readcounts processed for all nine samples (heterozygous and wild-type)
readcounts_het_wt <- readcounts[ , c(4:9)]
# Create a dataframe for the gene information for each sample
sample_info_het_wt <- DataFrame(condition_het_wt =gsub("_[0-9]+", "",names(readcounts_het_wt)),row.names =names(readcounts_het_wt) )
#Generate the DESeqDataSet from our counts for the heterozygous and wild-type samples
DESeq.ds_het_wt <- DESeqDataSetFromMatrix(countData = as.matrix(readcounts_het_wt), colData = sample_info_het_wt, design = ~ condition_het_wt)
# Keep only the genes which contain counts for the samples
keep_genes_het_wt <- rowSums(counts(DESeq.ds_het_wt)) > 0
DESeq.ds_het_wt <- DESeq.ds_het_wt[ keep_genes_het_wt, ]
# Change the names of the genes from the ensembl IDs to the external gene names
ens_het_wt <- c(rownames(DESeq.ds_het_wt))
ens_het_wt <- gsub('\\..*', '', ens_het_wt)
mouse = useMart(biomart="ensembl", dataset="mmusculus_gene_ensembl")
mapped_names <- getBM(attributes=c('ensembl_gene_id',
'external_gene_name'),
values = ens_het_wt,
mart = mouse)
ens_df_het_wt <- as.data.frame(ens_het_wt)
colnames(ens_df_het_wt) <- "ensembl_gene_id"
mapped_names_ordered_het_wt <- mapped_names[order(match(mapped_names[,1],ens_df_het_wt[,1])),]
idmap_het_wt = merge(x = ens_df_het_wt, y = mapped_names_ordered_het_wt, by="ensembl_gene_id", all.x=TRUE)
idmap_het_wt$external_gene_name <- ifelse(is.na(idmap_het_wt$external_gene_name), idmap_het_wt$ensembl_gene_id, idmap_het_wt$external_gene_name)
idmap_ordered_het_wt <- idmap_het_wt[order(match(idmap_het_wt[,1],ens_df_het_wt[,1])),]
row.names(DESeq.ds_het_wt) <- make.names(idmap_ordered_het_wt[,2])
# Reduce the dependence of the variance on the mean using rlog
DESeq.rlog_het_wt <-rlog(DESeq.ds_het_wt, blind = TRUE)
# Relevel the data so that the reference is the wild-type samples.
DESeq.ds_het_wt$condition_het_wt <- relevel(DESeq.ds_het_wt$condition_het_wt, ref = 'WT')
# Create a DESeqDataSet object for DE analyis
DESeq.ds_het_wt <- DESeq(DESeq.ds_het_wt)
# Extract a results table from a DESeq analysis
DGE.results_het_wt <- results(DESeq.ds_het_wt)
# Adjust for types of noise we might see using lfcShrink
DGE.results.shrink_het_wt <- lfcShrink(DESeq.ds_het_wt, coef = 2, type = 'apeglm')
# Find top 25 upregulated and top 9 downregulated genes which are significant given a threshold above 1 log fold change for upregulated genes and below -1 log fold change for downregulated genes
upregulated_genes_values_het_wt <- DGE.results.shrink_het_wt$log2FoldChange[DGE.results.shrink_het_wt$log2FoldChange > 1]
downregulated_genes_values_het_wt <- DGE.results.shrink_het_wt$log2FoldChange[DGE.results.shrink_het_wt$log2FoldChange < -1]
upregulated_genes_het_wt <- row.names(DGE.results.shrink_het_wt)[DGE.results.shrink_het_wt$log2FoldChange > 1]
upregulated_genes_values_het_wt_sorted <- upregulated_genes_values_het_wt[sort.list(upregulated_genes_values_het_wt)]
top_twentyfive_upregulated_genes_het_wt <- names(upregulated_genes_values_het_wt_sorted[1:25])
downregulated_genes_het_wt <- row.names(DGE.results.shrink_het_wt)[DGE.results.shrink_het_wt$log2FoldChange < -1]
downregulated_genes_values_het_wt_sorted <- downregulated_genes_values_het_wt[sort.list(downregulated_genes_values_het_wt)]
top_nine_downregulated_genes_het_wt <- names(downregulated_genes_values_het_wt_sorted[1:9])
# Find upregulated and downregulated values that are significant by p-value and log fold change
finding_upregulated_genes_in_significant_genes <- upregulated_genes_het_wt %in% DGEgenes_het_wt
final_upregulated_genes_het_wt <- upregulated_genes_het_wt[finding_upregulated_genes_in_significant_genes]
finding_downregulated_genes_in_significant_genes <- downregulated_genes_het_wt %in% DGEgenes_het_wt
final_downregulated_genes_het_wt <- downregulated_genes_het_wt[finding_downregulated_genes_in_significant_genes]
# Find the top 27 upregulated genes that are also significant
double_significant_upregulated_genes_het_wt <- DGE.results.shrink_het_wt[final_upregulated_genes_het_wt, ]
top_27_significant_and_upregulated_genes_het_wt <- rownames(double_significant_upregulated_genes_het_wt)[double_significant_upregulated_genes_het_wt$log2FoldChange > 1]
# Find the top 8 downregulated genes that are also significant
double_significant_downregulated_genes_het_wt <- DGE.results.shrink_het_wt[final_downregulated_genes_het_wt, ]
top_8_significant_and_downregulated_genes_het_wt <- rownames(double_significant_downregulated_genes_het_wt)[double_significant_downregulated_genes_het_wt$log2FoldChange < -1]
Next, we show the necessary code to create the objects we used to generate the figures for the heterozygous vs. diabetes comparison.
## Take only the first 6 columns processed for all nine samples (diabetes and heterozygous)
readcounts_het_db <- readcounts[ , c(1:6)]
# Create a dataframe for the gene information for each sample
sample_info_het_db <- DataFrame(condition_het_db =gsub("_[0-9]+", "",names(readcounts_het_db)),row.names =names(readcounts_het_db) )
#Generate the DESeqDataSet from our counts for the heterozygous and diabetes samples
DESeq.ds_het_db <- DESeqDataSetFromMatrix(countData = as.matrix(readcounts_het_db), colData = sample_info_het_db, design = ~ condition_het_db)
# Keep only the genes which contain counts for the samples
keep_genes_het_db <- rowSums(counts(DESeq.ds_het_db)) > 0
DESeq.ds_het_db <- DESeq.ds_het_db[ keep_genes_het_db, ]
# Change the names of the genes from the ensembl IDs to the external gene names
ens_het_db <- c(rownames(DESeq.ds_het_db))
ens_het_db <- gsub('\\..*', '', ens_het_db)
mouse = useMart(biomart="ensembl", dataset="mmusculus_gene_ensembl")
mapped_names <- getBM(attributes=c('ensembl_gene_id',
'external_gene_name'),
values = ens_het_db,
mart = mouse)
ens_df_het_db <- as.data.frame(ens_het_db)
colnames(ens_df_het_db) <- "ensembl_gene_id"
mapped_names_ordered_het_db <- mapped_names[order(match(mapped_names[,1],ens_df_het_db[,1])),]
idmap_het_db = merge(x = ens_df_het_db, y = mapped_names_ordered_het_db, by="ensembl_gene_id", all.x=TRUE)
idmap_het_db$external_gene_name <- ifelse(is.na(idmap_het_db$external_gene_name), idmap_het_db$ensembl_gene_id, idmap_het_db$external_gene_name)
idmap_ordered_het_db <- idmap_het_db[order(match(idmap_het_db[,1],ens_df_het_db[,1])),]
row.names(DESeq.ds_het_db) <- make.names(idmap_ordered_het_db[,2])
# Reduce the dependence of the variance on the mean using rlog
DESeq.rlog_het_db <-rlog(DESeq.ds_het_db, blind = TRUE)
# Relevel the data so that the reference is the heterozygous samples.
DESeq.ds_het_db$condition_het_db <- relevel(DESeq.ds_het_db$condition_het_db, ref = 'HETEROZYGOUS')
# Create a DESeqDataSet object for DE analyis
DESeq.ds_het_db <- DESeq(DESeq.ds_het_db)
# Extract a results table from a DESeq analysis
DGE.results_het_db <- results(DESeq.ds_het_db)
# Adjust for types of noise we might see using lfcShrink
DGE.results.shrink_het_db <- lfcShrink(DESeq.ds_het_db, coef = 2, type = 'apeglm')
# Find top 50 upregulated and top 50 downregulated genes which are significant given a threshold above 1 log fold change for upregulated genes and below -1 log fold change for downregulated genes
upregulated_genes_values_het_db <- DGE.results.shrink_het_db$log2FoldChange[DGE.results.shrink_het_db$log2FoldChange > 1]
downregulated_genes_values_het_db <- DGE.results.shrink_het_db$log2FoldChange[DGE.results.shrink_het_db$log2FoldChange < -1]
upregulated_genes_het_db <- row.names(DGE.results.shrink_het_db)[DGE.results.shrink_het_db$log2FoldChange > 1]
upregulated_genes_values_het_db_sorted <- upregulated_genes_values_het_db[sort.list(upregulated_genes_values_het_db)]
top_fifty_upregulated_genes_het_db <- names(upregulated_genes_values_het_db_sorted[1:50])
downregulated_genes_het_db <- row.names(DGE.results.shrink_het_db)[DGE.results.shrink_het_db$log2FoldChange < -1]
downregulated_genes_values_het_db_sorted <- downregulated_genes_values_het_db[sort.list(downregulated_genes_values_het_db)]
top_fifty_downregulated_genes_het_db <- names(downregulated_genes_values_het_db_sorted[1:50])
# Find upregulated and downregulated values that are significant by p-value and log fold change
finding_upregulated_genes_in_significant_genes_het_db <- upregulated_genes_het_db %in% DGEgenes_het_db
final_upregulated_genes_het_db <- upregulated_genes_het_db[finding_upregulated_genes_in_significant_genes_het_db]
finding_downregulated_genes_in_significant_genes_het_db <- downregulated_genes_het_db %in% DGEgenes_het_db
final_downregulated_genes_het_db <- downregulated_genes_het_db[finding_downregulated_genes_in_significant_genes_het_db]
# Find the top 50 upregulated genes that are also significant
double_significant_upregulated_genes_het_db <- DGE.results.shrink_het_db[final_upregulated_genes_het_db, ]
top_50_significant_and_upregulated_genes_het_db <- rownames(double_significant_upregulated_genes_het_db)[double_significant_upregulated_genes_het_db$log2FoldChange > 6.07]
# Find the top 50 downregulated genes that are also significant
double_significant_downregulated_genes_het_db <- DGE.results.shrink_het_db[final_downregulated_genes_het_db, ]
top_50_significant_and_downregulated_genes_het_db <- rownames(double_significant_downregulated_genes_het_db)[double_significant_downregulated_genes_het_db$log2FoldChange < -2.48]
Next, we show the necessary code to create the objects we used to generate the figures for the diabetes vs. wild-type comparison. We also note that for the figures below, DGE.results_db_wt_go is the same as DESeq.ds_db_wt so no extra code creation is necessary to make DGE.results_db_wt_go.
## Take only the first 3 and last 3 columns processed for all nine samples (diabetes and wild-type)
readcounts_db_wt <- readcounts[ , c(1:3, 7:9)]
# Create a dataframe for the gene information for each sample
sample_info_db_wt <- DataFrame(condition_db_wt =gsub("_[0-9]+", "",names(readcounts_db_wt)),row.names =names(readcounts_db_wt) )
#Generate the DESeqDataSet from our counts for the diabetes and wild-type samples
DESeq.ds_db_wt <- DESeqDataSetFromMatrix(countData = as.matrix(readcounts_db_wt), colData = sample_info_db_wt, design = ~ condition_db_wt)
# Keep only the genes which contain counts for the samples
keep_genes_db_wt <- rowSums(counts(DESeq.ds_db_wt)) > 0
DESeq.ds_db_wt <- DESeq.ds_db_wt[ keep_genes_db_wt, ]
# Change the names of the genes from the ensembl IDs to the external gene names
ens_db_wt <- c(rownames(DESeq.ds_db_wt))
ens_db_wt <- gsub('\\..*', '', ens_db_wt)
mouse = useMart("ensembl", dataset="mmusculus_gene_ensembl")
mapped_names <- getBM(attributes=c('ensembl_gene_id',
'external_gene_name'),
values = ens_db_wt,
mart = mouse)
ens_df_db_wt <- as.data.frame(ens_db_wt)
colnames(ens_df_db_wt) <- "ensembl_gene_id"
mapped_names_ordered_db_wt <- mapped_names[order(match(mapped_names[,1],ens_df_db_wt[,1])),]
idmap_db_wt = merge(x = ens_df_db_wt, y = mapped_names_ordered_db_wt, by="ensembl_gene_id", all.x=TRUE)
idmap_db_wt$external_gene_name <- ifelse(is.na(idmap_db_wt$external_gene_name), idmap_db_wt$ensembl_gene_id, idmap_db_wt$external_gene_name)
idmap_ordered_db_wt <- idmap_db_wt[order(match(idmap_db_wt[,1],ens_df_db_wt[,1])),]
rownames(DESeq.ds_db_wt) <- make.names(idmap_ordered_db_wt[,2])
# Reduce the dependence of the variance on the mean using rlog
DESeq.rlog_db_wt <-rlog(DESeq.ds_db_wt, blind = TRUE)
# Relevel the data so that the reference is the wild-type samples.
DESeq.ds_db_wt$condition_db_wt <- relevel(DESeq.ds_db_wt$condition_db_wt, ref = 'WT')
# Create a DESeqDataSet object for DE analyis
DESeq.ds_db_wt <- DESeq(DESeq.ds_db_wt)
# Extract a results table from a DESeq analysis
DGE.results_db_wt <- results(DESeq.ds_db_wt)
# Adjust for types of noise we might see using lfcShrink
DGE.results.shrink_db_wt <- lfcShrink(DESeq.ds_db_wt, coef = 2, type = 'apeglm')
# Find top 50 upregulated and top 50 downregulated genes which are significant given a threshold above 1 log fold change for upregulated genes and below -1 log fold change for downregulated genes
upregulated_genes_values_db_wt <- DGE.results.shrink_db_wt$log2FoldChange[DGE.results.shrink_db_wt$log2FoldChange > 1]
upregulated_genes_values_db_wt_sorted <- upregulated_genes_values_db_wt[sort.list(upregulated_genes_values_db_wt)]
top_fifty_upregulated_genes_db_wt <- names(upregulated_genes_values_db_wt_sorted[1:50])
downregulated_genes_values_db_wt <- DGE.results.shrink_db_wt$log2FoldChange[DGE.results.shrink_db_wt$log2FoldChange < -1]
downregulated_genes_values_db_wt_sorted <- downregulated_genes_values_db_wt[sort.list(downregulated_genes_values_db_wt)]
top_fifty_downregulated_genes_db_wt <- names(downregulated_genes_values_db_wt_sorted[1:50])
# Find upregulated and downregulated values that are significant by p-value and log fold change
upregulated_genes_db_wt <- row.names(DGE.results.shrink_db_wt)[DGE.results.shrink_db_wt$log2FoldChange > 1]
downregulated_genes_db_wt <- row.names(DGE.results.shrink_db_wt)[DGE.results.shrink_db_wt$log2FoldChange < -1]
finding_upregulated_genes_in_significant_genes_db_wt <- upregulated_genes_db_wt %in% DGEgenes_db_wt
final_upregulated_genes_db_wt <- upregulated_genes_db_wt[finding_upregulated_genes_in_significant_genes_db_wt]
finding_downregulated_genes_in_significant_genes_db_wt <- downregulated_genes_db_wt %in% DGEgenes_db_wt
final_downregulated_genes_db_wt <- downregulated_genes_db_wt[finding_downregulated_genes_in_significant_genes_db_wt]
# Find the top 50 upregulated genes that are also significant
double_significant_upregulated_genes <- DGE.results.shrink_db_wt[final_upregulated_genes_db_wt, ]
top_50_significant_and_upregulated_genes <- rownames(double_significant_upregulated_genes)[double_significant_upregulated_genes$log2FoldChange > 5.6]
# Find the top 50 downregulated genes that are also significant
double_significant_downregulated_genes <- DGE.results.shrink_db_wt[final_downregulated_genes_db_wt, ]
top_50_significant_and_downregulated_genes <- rownames(double_significant_downregulated_genes)[double_significant_downregulated_genes$log2FoldChange < -2.42]
Next, we show the necessary code to create the objects we used to generate the figures for the diabetes vs. wild-type comparison from the publication counts.
# Importing counts from our featureCounts run for all samples from publication counts
readcounts_paper <- paste0("featurecounts_from_paper.txt") %>% read.table(., header=TRUE)
## Take only the columns for the diabetes and wild-type samples in order so that the first 3 are wild-type and last 3 are diabetes samples
readcounts_paper_db_wt <- readcounts_paper[ , c(1, 5:6, 8, 7, 9:10)]
# Create a dataframe for the gene information for each sample
sample_info_paper_db_wt <- DataFrame(condition_paper_db_wt =gsub("_[0-9]+", "",names(readcounts_paper_db_wt)),row.names =names(readcounts_paper_db_wt) )
#Generate the DESeqDataSet from our counts for the diabetes and wild-type samples
DESeq.ds_paper_db_wt <- DESeqDataSetFromMatrix(countData = as.matrix(readcounts_paper_db_wt), colData = sample_info_paper_db_wt, design = ~ condition_paper_db_wt)
# Keep only the genes which contain counts for the samples
keep_genes_paper_db_wt <- rowSums(counts(DESeq.ds_paper_db_wt)) > 0
DESeq.ds_paper_db_wt <- DESeq.ds_paper_db_wt[ keep_genes_paper_db_wt, ]
# Reduce the dependence of the variance on the mean using rlog
DESeq.rlog_paper_db_wt <-rlog(DESeq.ds_paper_db_wt, blind = TRUE)
# Relevel the data so that the reference is the wild-type samples.
DESeq.ds_paper_db_wt$condition_paper_db_wt <- relevel(DESeq.ds_paper_db_wt$condition_paper_db_wt, ref = 'WT')
# Create a DESeqDataSet object for DE analyis
DESeq.ds_paper_db_wt <- DESeq(DESeq.ds_paper_db_wt)
# Extract a results table from a DESeq analysis
DGE.results_paper_db_wt <- results(DESeq.ds_paper_db_wt)
# Adjust for types of noise we might see using lfcShrink
DGE.results.shrink_paper_db_wt <- lfcShrink(DESeq.ds_paper_db_wt, coef = 2, type = 'apeglm')
# Find top 50 upregulated and top 50 downregulated genes which are significant given a threshold above 1 log fold change for upregulated genes and below -1 log fold change for downregulated genes
upregulated_genes_values_paper_db_wt <- DGE.results.shrink_paper_db_wt$log2FoldChange[DGE.results.shrink_paper_db_wt$log2FoldChange > 1]
upregulated_genes_values_paper_db_wt_sorted <- upregulated_genes_values_paper_db_wt[sort.list(upregulated_genes_values_paper_db_wt)]
top_fifty_upregulated_genes_paper_db_wt <- names(upregulated_genes_values_paper_db_wt_sorted[1:50])
downregulated_genes_values_paper_db_wt <- DGE.results.shrink_paper_db_wt$log2FoldChange[DGE.results.shrink_paper_db_wt$log2FoldChange < -1]
downregulated_genes_values_paper_db_wt_sorted <- downregulated_genes_values_paper_db_wt[sort.list(downregulated_genes_values_paper_db_wt)]
top_fifty_downregulated_genes_paper_db_wt <- names(downregulated_genes_values_paper_db_wt_sorted[1:50])
# Find upregulated and downregulated values that are significant by p-value and log fold change
upregulated_genes_paper_db_wt <- row.names(DGE.results.shrink_paper_db_wt)[DGE.results.shrink_paper_db_wt$log2FoldChange > 1]
downregulated_genes_paper_db_wt <- row.names(DGE.results.shrink_paper_db_wt)[DGE.results.shrink_paper_db_wt$log2FoldChange < -1]
# Find upregulated and downregulated values that are significant by p-value and log fold change
finding_upregulated_genes_in_significant_genes_paper_db_wt <- upregulated_genes_paper_db_wt %in% DGEgenes_paper_db_wt
final_upregulated_genes_paper_db_wt <- upregulated_genes_paper_db_wt[finding_upregulated_genes_in_significant_genes_paper_db_wt]
finding_downregulated_genes_in_significant_genes_paper_db_wt <- downregulated_genes_paper_db_wt %in% DGEgenes_paper_db_wt
final_downregulated_genes_paper_db_wt <- downregulated_genes_paper_db_wt[finding_downregulated_genes_in_significant_genes_paper_db_wt]
# Find the top 50 upregulated genes that are also significant
double_significant_upregulated_genes_paper <- DGE.results.shrink_paper_db_wt[final_upregulated_genes_paper_db_wt, ]
top_50_significant_and_upregulated_genes_paper <- rownames(double_significant_upregulated_genes_paper)[double_significant_upregulated_genes_paper$log2FoldChange > 5.6]
# Find the top 50 downregulated genes that are also significant
double_significant_downregulated_genes_paper <- DGE.results.shrink_paper_db_wt[final_downregulated_genes_paper_db_wt, ]
top_50_significant_and_downregulated_genes_paper <- rownames(double_significant_downregulated_genes_paper)[double_significant_downregulated_genes_paper$log2FoldChange < -2.42]
Below we form a table for the number of significant genes by p-value, top 3 most significant genes, number of significant genes by p-value and log 2 FC, and number of upregulated and downregulated genes, for all comparisons between samples from our counts. A gene was significant by p-value if the gene’s adjusted p-value was smaller than 0.05. A gene was significant with respect to fold change if it had a log2 fold change value greater than 1, which represented a significant upregulated gene, or less than -1, which represented a significant downregulated gene. We also include these results for the comparison of diabetic vs. wild-type mice using the counts from the published paper.
| Source Of Counts Used | Samples Compared | Number of Significant Genes By P-value | Top 3 Significant Genes By P-value | Number of Significant Genes By P-value and Log2 FC | Number of Upregulated/Downregulated Genes Significant By P-value and Log2 FC |
|---|---|---|---|---|---|
| Our Counts | HET vs. WT | 672 | Lars2, ENSMUSG00000106106 (CT010467.1), Gm15564 | 35 | 27/8 |
| Our Counts | HET vs. DB | 5993 | Aldh1a3, Serpina7, Gc | 1995 | 1492/503 |
| Our Counts | DB vs. WT | 5948 | Aldh1a3, Serpina7, Gc | 2092 | 1619/473 |
| Publication Counts | DB vs. WT | 5575 | Serpina7, Aldh1a3, Npas4 | 1827 | 1354/473 |
Below we show the PCA plot resulting from our data. This plot is very similar to the PCA shown in the original paper, as all three samples are distinct. There is one diabetes sample that is further from the other two, but in relation to all samples, this diabetes sample is much closer to the other diabetes samples than the heterozygous or wild type samples. This means that the samples have distinct features that separate them from the other types of mice in the experiment.
The code used to create the PCA plot is shown below, followed by importing the PCA plot.
# We set ntop = 500 (the default) to show this represents the top 500 most variable genes used by the PCA
plotPCA(DESeq.rlog, ntop = 500) + theme_bw()
Figure 9. PCA plot showing clustering of samples by phenotype.
Throughout the analysis, we created multiple heatmaps for different sets of genes between all comparisons. These heatmaps correspond to all significant genes by p-value, the top upregulated genes also significant by p-value, and the top downregulated genes also significant by p-value. All heatmaps were produced using the rlog transformed data.
The code used to create the heatmap for all significant genes for heterozygous vs. wild-type is shown below, followed by importing the heatmap.
DGEgenes_het_wt <- subset(DGE.results_het_wt, padj < 0.05) %>% rownames
row.names(DESeq.rlog_het_wt) <- rownames(DGE.results_het_wt)
rlog.dge_het_wt <- DESeq.rlog_het_wt[DGEgenes_het_wt, ] %>% assay
z.mat <- t(scale(t(rlog.dge_het_wt), center=TRUE, scale=TRUE))
colnames(z.mat) = c('HET_1', 'HET_2', 'HET_3', 'WT_1', 'WT_2', 'WT_3')
hcDat <- hclust(dist(z.mat))
cutGroups <- cutree(hcDat, h=4)
Heatmap(z.mat, name = "z-score",
show_row_name = FALSE,
cluster_columns = FALSE,
split=cutGroups,
column_title = "Heatmap For HET vs. WT Significant Genes")
Figure 10. Heatmap for all significant genes for heterozygous vs. wild-type.
In the publication, the authors showed the top 25 upregulated and downregulated genes for the comparison between heterozygous and wild-type. However, they did not specify a log2 FC cutoff which made the genes significant in respect to fold change. It is likely that the authors used genes without log2 fold changes greater than 1 for upregulated genes or less than -1 for downregulated genes. However, we believe that only genes that meet these standards should be included because these show a significant upregulation or downregulation between samples. Therefore, we were able to produce a heatmap for the top 27 upregulated genes, which contained all significant genes by p-value also upregulated over a log2 fold change greater than 1, but only the top 8 downregulated genes satisfied our condition. These genes are shown in the downregulated genes heatmap below. The code used to create the heatmaps for the top 27 upregulated genes and top 8 downregulated genes for heterozygous vs. wild-type are shown below, followed by importing the heatmaps.
DGEgenes_het_wt_top_twenty_seven_upregulated <- top_27_significant_and_upregulated_genes_het_wt
rlog.dge_het_wt_top_twenty_seven_upregulated <- DESeq.rlog_het_wt[DGEgenes_het_wt_top_twenty_seven_upregulated, ] %>% assay
z.mat <- t(scale(t(rlog.dge_het_wt_top_twenty_seven_upregulated), center=TRUE, scale=TRUE))
hcDat <- hclust(dist(z.mat))
cutGroups <- cutree(hcDat, h=4)
Heatmap(z.mat, name = "z-score",
column_title = "HET vs. WT: Top 27 Significant and Upregulated Genes",
column_title_gp = gpar(fontsize = 10),
show_row_name = TRUE,
cluster_columns = FALSE,
split=cutGroups)
Figure 11. Heatmap for the top 27 upregulated genes for heterozygous vs. wild-type.
DGEgenes_het_wt_top_eight_downregulated <- top_8_significant_and_downregulated_genes_het_wt
rlog.dge_het_wt_top_eight_downregulated <- DESeq.rlog_het_wt[DGEgenes_het_wt_top_eight_downregulated, ] %>% assay
z.mat <- t(scale(t(rlog.dge_het_wt_top_eight_downregulated), center=TRUE, scale=TRUE))
hcDat <- hclust(dist(z.mat))
cutGroups <- cutree(hcDat, h=4)
Heatmap(z.mat, name = "z-score",
column_title = "HET vs. WT: Top 8 Significant and Downregulated Genes",
column_title_gp = gpar(fontsize = 10),
show_row_name = TRUE,
cluster_columns = FALSE,
split=cutGroups)
Figure 12. Heatmap for the top 8 downregulated genes for heterozygous vs. wild-type.
The code used to create the heatmap for all significant genes for heterozygous vs. diabetic is shown below, followed by importing the heatmap.
DGEgenes_het_db <- subset(DGE.results_het_db, padj < 0.05) %>% rownames
row.names(DESeq.rlog_het_db) <- rownames(DGE.results_het_db)
rlog.dge_het_db <- DESeq.rlog_het_db[DGEgenes_het_db, ] %>% assay
z.mat <- t(scale(t(rlog.dge_het_db), center=TRUE, scale=TRUE))
colnames(z.mat) = c('DB_1', 'DB_2', 'DB_3', 'HET_1', 'HET_2', 'HET_3')
hcDat <- hclust(dist(z.mat))
cutGroups <- cutree(hcDat, h=4)
Heatmap(z.mat, name = "z-score",
show_row_name = FALSE,
cluster_columns = FALSE,
split=cutGroups,
column_title = "Heatmap For DB vs. HET Significant Genes")
Figure 13. Heatmap for all significant genes for heterozygous vs. diabetes.
When comparing the heterozygous and diabetic conditions, there are many more significant genes than that of the heterozygous vs wild-type. Therefore, it was possible to create heat maps displaying the differences between the top 50 upregulated and downregulated genes. The code used to create the heatmaps for the top 50 upregulated genes and top 50 downregulated genes for heterozygous vs. diabetic are shown below, followed by importing the heatmaps.
DGEgenes_het_db_top_fifty_upregulated <- top_50_significant_and_upregulated_genes_het_db
rlog.dge_het_db_top_fifty_upregulated <- DESeq.rlog_het_db[DGEgenes_het_db_top_fifty_upregulated, ] %>% assay
z.mat <- t(scale(t(rlog.dge_het_db_top_fifty_upregulated), center=TRUE, scale=TRUE))
hcDat <- hclust(dist(z.mat))
cutGroups <- cutree(hcDat, h=4)
Heatmap(z.mat, name = "z-score",
column_title = "HET vs. DB: Top 50 Significant and Upregulated Genes",
column_title_gp = gpar(fontsize = 10),
show_row_name = TRUE,
cluster_columns = FALSE,
split=cutGroups)
Figure 14. Heatmap for the top 50 upregulated genes for heterozygous vs. diabetes.
DGEgenes_het_db_top_fifty_downregulated <- top_50_significant_and_downregulated_genes_het_db
rlog.dge_het_db_top_fifty_downregulated <- DESeq.rlog_het_db[DGEgenes_het_db_top_fifty_downregulated, ] %>% assay
z.mat <- t(scale(t(rlog.dge_het_db_top_fifty_downregulated), center=TRUE, scale=TRUE))
hcDat <- hclust(dist(z.mat))
cutGroups <- cutree(hcDat, h=4)
Heatmap(z.mat, name = "z-score",
column_title = "HET vs. DB: Top 50 Significant and Downregulated Genes",
column_title_gp = gpar(fontsize = 10),
show_row_name = TRUE,
cluster_columns = FALSE,
split=cutGroups)
Figure 15. Heatmap for the top 50 downregulated genes for heterozygous vs. diabetes.
The code used to create the heatmap for all significant genes for diabetic vs. wild-type is shown below, followed by importing the heatmap.
DGEgenes_db_wt <- subset(DGE.results_db_wt, padj < 0.05) %>% rownames
row.names(DESeq.rlog_db_wt) <- rownames(DGE.results_db_wt)
rlog.dge_db_wt <- DESeq.rlog_db_wt[DGEgenes_db_wt, ] %>% assay
z.mat <- t(scale(t(rlog.dge_db_wt), center=TRUE, scale=TRUE))
colnames(z.mat) <- c('DB_1', 'DB_2', 'DB_3', 'WT_1', 'WT_2', 'WT_3')
hcDat <- hclust(dist(z.mat))
cutGroups <- cutree(hcDat, h=4)
Heatmap(z.mat, name = "z-score",
show_row_name = FALSE,
cluster_columns = FALSE,
split=cutGroups,
column_title = "Heatmap For DB vs. WT Significant Genes")
Figure 16. Heatmap for all significant genes for diabetes vs. wild-type.
Just as in the case of the heterozygous vs diabetic groups, when comparing the diabetic and wild-type conditions, there are many more significant genes than that of the heterozygous vs wild-type. Following the same pipeline, it was possible to create heat maps displaying the differences between the top 50 upregulated and downregulated genes between the diabetic and wild-type samples. The code used to create the heatmaps for the top 50 upregulated genes and top 50 downregulated genes for diabetic vs. wild-type are shown below, followed by importing the heatmaps.
DGEgenes_db_wt_top_fifty_upregulated <- top_50_significant_and_upregulated_genes
rlog.dge_db_wt_top_fifty_upregulated <- DESeq.rlog_db_wt[DGEgenes_db_wt_top_fifty_upregulated, ] %>% assay
z.mat <- t(scale(t(rlog.dge_db_wt_top_fifty_upregulated), center=TRUE, scale=TRUE))
hcDat <- hclust(dist(z.mat))
cutGroups <- cutree(hcDat, h=4)
Heatmap(z.mat, name = "z-score",
column_title = "DB vs. WT: Top 50 Significant and Upregulated Genes",
column_title_gp = gpar(fontsize = 10),
show_row_name = TRUE,
cluster_columns = FALSE,
split=cutGroups)
Figure 17. Heatmap for the top 50 upregulated genes for diabetes vs. wild-type.
DGEgenes_db_wt_top_fifty_downregulated <- top_50_significant_and_downregulated_genes
rlog.dge_db_wt_top_fifty_downregulated <- DESeq.rlog_db_wt[DGEgenes_db_wt_top_fifty_downregulated, ] %>% assay
z.mat <- t(scale(t(rlog.dge_db_wt_top_fifty_downregulated), center=TRUE, scale=TRUE))
hcDat <- hclust(dist(z.mat))
cutGroups <- cutree(hcDat, h=4)
Heatmap(z.mat, name = "z-score",
column_title = "DB vs. WT: Top 50 Significant and Downregulated Genes",
column_title_gp = gpar(fontsize = 10),
show_row_name = TRUE,
cluster_columns = FALSE,
split=cutGroups)
Figure 18. Heatmap for the top 50 downregulated genes for diabetes vs. wild-type.
Next, we created the heatmaps for the diabetic vs. wild-type samples using the counts from the publication. We did this so that we could compare the results directly with our analysis and to demonstrate if our pipeline produced the same results as that used in the publication, which was never mentioned. The code used to create the heatmap using the counts from the publication for all significant genes for diabetic vs. wild-type is shown below, followed by importing the heatmap.
DGEgenes_paper_db_wt <- subset(DGE.results_paper_db_wt, padj < 0.05) %>% rownames
row.names(DESeq.rlog_paper_db_wt) <- rownames(DGE.results_paper_db_wt)
rlog.dge_paper_db_wt <- DESeq.rlog_paper_db_wt[DGEgenes_paper_db_wt, ] %>% assay
z.mat <- t(scale(t(rlog.dge_paper_db_wt), center=TRUE, scale=TRUE))
colnames(z.mat) <- c('WT_1', 'WT_2', 'WT_3', 'DB_1', 'DB_2', 'DB_3')
hcDat <- hclust(dist(z.mat))
cutGroups <- cutree(hcDat, h=4)
Heatmap(z.mat, name = "z-score",
show_row_name = FALSE,
cluster_columns = FALSE,
split=cutGroups,
column_title = "Heatmap For DB vs. WT Significant Genes")
Figure 19. Heatmap for all significant genes for diabetes vs. wild-type from publication counts.
Following the process used above for our counts, we were able to create a heatmap for the top 50 upregulated and top 50 downregulated genes between the diabetic and wild-type samples. The code used to create the heatmaps using the counts from the publication for the top 50 upregulated genes and top 50 downregulated genes for diabetic vs. wild-type are shown below, followed by importing the heatmaps.
DGEgenes_paper_db_wt_top_fifty_upregulated <- top_50_significant_and_upregulated_genes_paper
rlog.dge_paper_db_wt_top_fifty_upregulated <- DESeq.rlog_paper_db_wt[DGEgenes_paper_db_wt_top_fifty_upregulated, ] %>% assay
z.mat <- t(scale(t(rlog.dge_paper_db_wt_top_fifty_upregulated), center=TRUE, scale=TRUE))
hcDat <- hclust(dist(z.mat))
cutGroups <- cutree(hcDat, h=4)
Heatmap(z.mat, name = "z-score",
column_title = "DB vs. WT: Top 50 Significant and Upregulated Genes",
column_title_gp = gpar(fontsize = 10),
show_row_name = TRUE,
cluster_columns = FALSE,
split=cutGroups)
Figure 20. Heatmap for the top 50 upregulated genes for diabetes vs. wild-type from publication counts.
DGEgenes_paper_db_wt_top_fifty_downregulated <- top_50_significant_and_downregulated_genes_paper
rlog.dge_paper_db_wt_top_fifty_downregulated <- DESeq.rlog_paper_db_wt[DGEgenes_paper_db_wt_top_fifty_downregulated, ] %>% assay
z.mat <- t(scale(t(rlog.dge_paper_db_wt_top_fifty_downregulated), center=TRUE, scale=TRUE))
hcDat <- hclust(dist(z.mat))
cutGroups <- cutree(hcDat, h=4)
Heatmap(z.mat, name = "z-score",
column_title = "DB vs. WT: Top 50 Significant and Downregulated Genes",
column_title_gp = gpar(fontsize = 10),
show_row_name = TRUE,
cluster_columns = FALSE,
split=cutGroups)
Figure 21. Heatmap for the top 50 downregulated genes for diabetes vs. wild-type from publication counts.
The results from the heatmap for the counts from the publication do have some overlap with those output by the publication, but do not agree exaclty. We attempted to figure out why this was the case. In the publication, the authors specify that the heatmaps represent “The most significant 50 genes in terms of increased expression…”, which we interpreted as the greatest log2 Fold changes for genes that were found to be significant by adjusted p-value. However, in the paper, the gene Gc is included in the Top 50 genes in terms of increased expression. Below we make a table showing that Gc should not be in the top 50 upregulated genes given our assumption about what increased expression is intended to mean. The Gc gene would represent the 140th most significant gene in terms of increased expression. This suggests to us that the authors included this gene in their top 50 most important genes because they put a larger weight on adjusted p-value or filtered out genes that they did not believe had any relevance and thus Gc was included. This suggests that other genes would also be scored differently than just using the log2 FC value for upregulation and downregulation. Therefore, our Top 50 upregulated and Top 50 downregulated genes should not match exactly with those from the paper because we created our heatmaps based on log FC values as the metric for significance. However, we found that there were at least one overlapping gene with the heatmaps shown in the publication, showing that our results are not far off from those of the authors. Unfortunatley, this represents a lack of communication in scientific research and an outcome where the results were not able to be properly duplicated.
| Rank In Upregulated Genes By log2 FC | Gene | log2 FC | padj |
|---|---|---|---|
| 1 | X1810009J06Rik | 14.39679 | 3.6617e-14 |
| 50 | X1810006J02Rik | 5.245285 | 4.94998e-23 |
| 140 | Gc | 3.44248 | 2.41211e-233 |
We have created a volcano plot for all comparisons of samples using the counts generated in our pipeline. We also generated a volcano plot for the comparison of diabetic vs. wild-type mice using the counts from the publication. All volcano plots were made using the shrunken log2 fold changes.
For the heterozygous vs. wild-type comparison, we created two volcano plots. The reason for this is because the original volcano plot showed genes with p-values that are extremely small, and therefore the whole graph is stretched, making it hard to see the other significant genes. We thus created a second volcano plot, restricting the height and width on the axes, in order to display more significant genes. The code used to create the volcano plots for heterozygous vs. wild-type is shown below, followed by importing the volcano plots.
vp2_het_wt <- EnhancedVolcano(DGE.results.shrink_het_wt,lab =rownames(DGE.results.shrink_het_wt),x ='log2FoldChange',y ='padj', pCutoff = 0.05,title = "HET vs WT: with logFC shrinkage")
Figure 22. Volcano plot for heterozygous vs. wild-type showing all genes.
vp2_het_wt_with_cutoff <- EnhancedVolcano(DGE.results.shrink_het_wt,lab =rownames(DGE.results.shrink_het_wt),x ='log2FoldChange',y ='padj', pCutoff = 0.05,title = "HET vs WT: with logFC shrinkage", xlim=c(-5,5), ylim = c(0,20))
Figure 23. Volcano plot for heterozygous vs. wild-type with restricted height and width to show more genes.
The code used to create the volcano plots for heterozygous vs. diabetic is shown below, followed by importing the volcano plot.
vp2_het_db <-EnhancedVolcano(DGE.results.shrink_het_db,lab=rownames(DGE.results.shrink_het_db),x ='log2FoldChange',y ='padj', pCutoff = 0.05,title = "DB vs HET: with logFC shrinkage")
Figure 24. Volcano plot for heterozygous vs. diabetes.
The code used to create the volcano plots for diabetic vs. wild-type is shown below, followed by importing the volcano plot.
vp2_db_wt <-EnhancedVolcano(DGE.results.shrink_db_wt,lab=rownames(DGE.results.shrink_db_wt),x ='log2FoldChange',y ='padj', pCutoff = 0.05,title = "DB vs WT: with logFC shrinkage")
Figure 25. Volcano plot for diabetes vs. wild-type.
The code used to create the volcano plot for diabetic vs. wild-type is shown below from the counts from the publication, followed by importing the volcano plot.
vp2_paper_db_wt <-EnhancedVolcano(DGE.results.shrink_paper_db_wt, lab =rownames(DGE.results.shrink_paper_db_wt),x ='log2FoldChange',y ='padj', pCutoff = 0.05,title = "DB vs WT: with logFC shrinkage")
Figure 26. Volcano plot for diabetes vs. wild-type from publication counts.
We noticed that the volcano plot generated above using the publication counts does not exactly match that reported in the paper. Therefore, we concluded that although specifying the use of DESeq2, the pipeline the author’s method used to create their volcano plot is not the same as ours. Since we used the shrunken log2 fold changes, we believe that our volcano plot is better representative of the true outcome because the shrunken values tend to be more conservative and match the expectations of a large sample size, which is a property of this experiment. We also found that using our counts, the volcano plot for diabetic vs. wild-type samples was extremely similar to that of the volcano plot produced using the counts from the publication. This validated that our pipeline throughout the project produced the expected results.
In addition to the heatmaps and volcano plots, we visualized significant genes using an MA plot for the diabetes and wild-type comparison for both the counts generated using our pipeline and the counts from the published paper. Both MA plots used the shrunken data. The first MA plot corresponds to the data from our counts, and the second MA plot corresponds to the data from the publication’s counts. Comparing these two figures, we find that there are no large discrepancies. The code used to create the MA plots for diabetic vs. wild-type are shown below for both our counts and from the counts from the publication, followed by importing the MA plots.
DGE.results.shrink_db_wt <-lfcShrink(DESeq.ds_db_wt, coef = 2, type = "apeglm")
plotMA(DGE.results.shrink_db_wt,alpha = 0.05, main = "with logFC shrinkage", ylim =c(-3,3))
abline(h = 1, col = 'red')
abline(h = -1, col = 'red')
Figure 27. MA plot for diabetes vs. wild-type.
DGE.results.shrink_paper_db_wt <- lfcShrink(DESeq.ds_paper_db_wt, coef = 2, type = "apeglm")
DESeq2::plotMA(DGE.results.shrink_paper_db_wt,alpha = 0.05, main = "with logFC shrinkage", ylim =c(-3,3))
abline(h = 1, col = 'red')
abline(h = -1, col = 'red')
Figure 28. MA plot for diabetes vs. wild-type from publication counts.
We performed gene ontology (GO) analysis in order to explore the functions of the statistically significant genes by p-value for the diabetic vs. wild-type comparison. We used the clusterProfiler library, recommended by Merv, to automate the process of biological-term classification. The function enrichGO calculates the enrichment test for GO terms. To visualize the most important biological processes from the significant genes, we created a dot plot of the 30 most significant processes. Below we show the code used to produce the dot plot, followed by importing the dot plot. Many of these biological processes seemed related to mechanisms involved in diabetes, such as protein localization to extracellular region and protein secretion, which have to do with the mechanics of insulin. Also, it is worthy to note that insulin secretion is included in these biological processes, which we would expect since diabetic mice are insulin suppressed.
# Change rownames so that they are readable for the GO analysis by removing all decimals and numbers after
DGEgenes_db_wt_go <- subset(DGE.results_db_wt_go, padj < 0.05) %>% rownames
DGEgenes_db_wt_go <- gsub("\\..*","",DGEgenes_db_wt_go)
gene_ontology_results <- enrichGO(gene = DGEgenes_db_wt_go,
OrgDb = org.Mm.eg.db,
keyType = 'ENSEMBL',
ont = "BP",
pAdjustMethod = "BH")
dotplot(gene_ontology_results, showCategory=30)
plotGOgraph(gene_ontology_results)
Figure 29. Results of GO analysis for diabetes vs. wild-type showing top 30 most significant biological processes.
In the paper, the authors aimed to establish db/db as an animal model for dedifferentiation by using RNA sequencing to compare the gene expression profile in islets isolated from wild-type, db/+ and db/db mice, and qPCR was performed to validate those significant genes. They concluded a reduction in both insulin secretion and the expression of Ins1, Ins2, Glut2, Pdx1 and MafA was indicative of dedifferentiation in db/db islets. Therefore, we have compared our results for these genes using both our counts matrix and the counts matrix of the publication, to see if these genes are found to be significantly downregulated in the diabetic samples when compared with the wild-type samples.
We found that using our counts matrix, we did not receive significant adjusted p-values or large shrunken log fold change values for the Ins1 and Ins2 genes. The results for our counts are tabulated below, along with the code used to generate them.
| Gene | Shrunken Log FC Value | Adjusted p-value |
|---|---|---|
| Ins1 | -0.417222 | 0.251649 |
| Ins2 | 0.0374102 | 0.874441 |
| Glut2 (Slc2a2) | -2.86349 | 3.26316e-09 |
| Pdx1 | -1.13373 | 7.2644e-30 |
| MafA | -3.5507 | 1.08515e-27 |
DGE.results.shrink_db_wt['Ins1', ]
DGE.results.shrink_db_wt['Ins2', ]
DGE.results.shrink_db_wt['Slc2a2', ] # AKA Glut2
DGE.results.shrink_db_wt['Pdx1', ]
DGE.results.shrink_db_wt['Mafa', ]
Using the counts matrix from the publication, we found consistent results with our count matrix output. In both analyses, Pdx1, MafA, and Glut2 are shown to be significant at our threshold log2 FC < -1. These three genes are also significant with respect to adjusted p-value as well. The results for the publication’s counts are tabulated below, along with the code used to generate them.
| Gene | Shrunken Log FC Value | Adjusted p-value |
|---|---|---|
| Ins1 | -0.319718 | 0.35666 |
| Ins2 | 0.0172055 | 0.944685 |
| Glut2 (Slc2a2) | -2.80892 | 1.36774e-12 |
| Pdx1 | -1.02596 | 6.89333e-24 |
| MafA | -3.15191 | 1.22448e-18 |
DGE.results.shrink_paper_db_wt['Ins1', ]
DGE.results.shrink_paper_db_wt['Ins2', ]
DGE.results.shrink_paper_db_wt['Slc2a2', ] # AKA Glut2
DGE.results.shrink_paper_db_wt['Pdx1', ]
DGE.results.shrink_paper_db_wt['Mafa', ]
Based on these results, we concluded that the db/db model is not as strong of an animal model for dedifferentiation as concluded by the paper, due to the lack of downregulation in Ins1 and Ins2 in diabetic samples demonstrated by the downstream analyses. One would expect these genes to be significantly downregulated for this animal model to best demonstrate use as a model for dedifferentiation.
One of the questions that was not addressed in the paper was the sex of the mice. It has been shown that human sexes differ in their responses to diabetes. Women have greater increases of cardiovascular risk, myocardial infarction, and stroke mortality than men, compared with nondiabetic subjects.\(^8\) Therefore, we had reason to suspect that it is possible male and female diabetic mice may have different responses to diabetes, possibly even in at a genetic level. To determine the sex of the mice, we utilized the QoRTs output files. The chromosome distribution (excluding autosomes) graph, shown below for diabetes sample 1, demonstrated that almost no reads were mapped to the Y chromosome of the mice. Thus, the diabetes sample 1 mouse was female. One reason why there were a few reads mapped to the Y chromosome could have been due to multi-mapped reads on both the X and Y chromosomes since there are many regions that are identical. We found that all samples used in this experiment were females. This could be a potential source of bias, due to the fact that males and females may express differences in their transcriptomes. Future experiments should include both male and female mice to capture potential differences resulting from sex.
Figure 30. QoRTs plot of chromosome distribution excluding autosomes for diabetes sample 1 showing that this mouse is a female.
This experiment used monogenic db/db mice to represent diabetic samples. Although monogenic diabetes has been found in humans, it is a rare form of diabetes. The most common types of diabetes are caused by multiple genes, and in T2D lifestyle factors such as obesity also influence diabetes progression.\(^{10}\) Therefore, it is possible that using a polygenic mouse strain, such as KK mice or NZO mice, may correlate better with the common diabetic condition of T2D in humans.
At the beginning of our QC, we discovered that the diabetes samples, especially diabetes sample 3, had a large amount of overrepresented sequences. Through the use of FastQC, we were able to identify these overrepresented sequences and use BLAST to identify possible genes that, if found later in the downstream analysis, would be artifacts of this bias. We believe that the reason for these overrepresented sequences could be due to increased PCR amplification in the diabetes samples, especially diabetes sample 3, compared to the heterozygous and wild-type samples. However, we did not find the genes that were potential problems output from BLAST in our downstream analyses.
After analyses of our counts was completed, we noticed that some of our results did not match those of the published paper. We therefore performed the same analyses using the counts directly from the authors of the publication. After finishing all analyses using these counts, we still found discrepancies between the results from our pipeline and those in the publication. These are exemplified by the contrast of the top 50 upregulated and top 50 downregulated genes displayed in the heatmaps for the diabetic and wild-type comparison between the analysis we performed on the counts from the publication, and the heatmaps found in the publication. In addition, the volcano plot we produced using the counts from the publication does not closely resemble the published volcano plot for the diabetic vs. wild-type comparison. As a result of the authors not including any information on how they performed their analyses other than including that they incorporate DESeq2, we believe this is a prime example of the lack of proper communication and reproducibility in science.
The original names of our genes were represented by the Ensembl gene names. In order to compare our results with those from the publication, we had to change these names to the external gene names using the biomaRt library. However, we discovered that not all Ensembl gene names had a corresponding external gene name. Therefore, these genes would result in a value of NA, which was not able to be deciphered for downstream analysis. To overcome this problem, we first converted all the ensembl gene names to external gene names, followed by renaming all NA values back to the ensembl gene name. This allowed us to perform the rest of our analyses and directly compare the same gene names with those output in the results of the publication.
| Name of Data Set | Description |
|---|---|
| final_project_featCounts (Location in SCU where it is called final_project_featCounts_mouse_genes_stranded_after_comments_from_merv.txt - /home/nib4003/ANGSD_2021_hw/final_project/counting_reads_for_all_samples) | This file contains the counts created using the pipeline in the methods. Its results are used in our exploratory analyses and differential expression analyses for the comparisons of heterozygous vs. wild-type, heterozygous vs. diabetic, and diabetic vs. wild-type mice samples, as well as gene ontology analysis for the diabetic vs. wild-type mice samples. |
| featurecounts_from_paper.txt (Location in SCU - /home/nib4003/ANGSD_2021_hw/final_project/counting_reads_for_all_samples) | This file contains the counts created by the authors of the publication. Its results are used to attempt the replication of some results from the publication for the diabetic vs. wild-type comparison. |
| BAM Files (Location - /home/nib4003/ANGSD_2021_hw/final_project/*aligned_star_reads where the beginning of the folder the sample name): diabetes_sample1_alignments.Aligned.sortedByCoord.out.bam diabetes_sample2_alignments.Aligned.sortedByCoord.out.bam diabetes_sample3_alignments.Aligned.sortedByCoord.out.bam heterozygous_sample1_alignments.Aligned.sortedByCoord.out.bam heterozygous_sample2_alignments.Aligned.sortedByCoord.out.bam heterozygous_sample3_alignments.Aligned.sortedByCoord.out.bam wt_sample1_alignments.Aligned.sortedByCoord.out.bam wt_sample2_alignments.Aligned.sortedByCoord.out.bam wt_sample3_alignments.Aligned.sortedByCoord.out.bam | Aligned fastq files to reference genome using STAR. Each file represents one of the samples used in the experiment. These files are used to create our count matrix through the use of featureCounts. |
Editor. “Diabetes Prevalence.” Diabetes.co.uk, 15 Jan. 2019, www.diabetes.co.uk/diabetes-prevalence.html#:~:text=UK%20diabetes%20prevalence&text=This%20means%20that%2C%20including%20the,diabetes%20(diagnosed%20and%20undiagnosed).
Khan, Moien Abdul Basith, et al. “Epidemiology of Type 2 Diabetes - Global Burden of Disease and Forecasted Trends.” Journal of Epidemiology and Global Health, Atlantis Press, Mar. 2020, www.ncbi.nlm.nih.gov/pmc/articles/PMC7310804/.
Mohammed-Ali, Zahraa, et al. “Animal Models of Kidney Disease.” Animal Models for the Study of Human Disease (Second Edition), Academic Press, 23 June 2017, www.sciencedirect.com/science/article/pii/B9780128094686000164.
Burke, Susan J, et al. “Db/Db Mice Exhibit Features of Human Type 2 Diabetes That Are Not Present in Weight-Matched C57BL/6J Mice Fed a Western Diet.” Journal of Diabetes Research, Hindawi, 2017, www.ncbi.nlm.nih.gov/pmc/articles/PMC5606106/.
Kulkarni, Rohit N. “The Islet β-Cell.” The International Journal of Biochemistry & Cell Biology, Pergamon, 7 Nov. 2003, www.sciencedirect.com/science/article/abs/pii/S1357272503002917?via%3Dihub.
Matthews, DR., et al. “RNA-Seq Analysis of Islets to Characterise the Dedifferentiation in Type 2 Diabetes Model Mice Db/Db.” Endocrine Pathology, Springer US, 1 Jan. 1998, link.springer.com/article/10.1007/s12022-018-9523-x.
King, Aileen J F. “The Use of Animal Models in Diabetes Research.” British Journal of Pharmacology, Blackwell Publishing Ltd, June 2012, www.ncbi.nlm.nih.gov/pmc/articles/PMC3417415/#:~:text=Although%20rats%20and%20mice%20are,to%20develop%20obesity%20in%20captivity.
Kautzky-Willer, Alexandra, et al. “Sex and Gender Differences in Risk, Pathophysiology and Complications of Type 2 Diabetes Mellitus.” Endocrine Reviews, Endocrine Society, June 2016, www.ncbi.nlm.nih.gov/pmc/articles/PMC4890267/.
A. Mortazavi, BA. Williams, et al. “Comparison of Stranded and Non-Stranded RNA-Seq Transcriptome Profiling and Investigation of Gene Overlap.” BMC Genomics, BioMed Central, 1 Jan. 1970, bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-1876-7.
Hormone Health Network.“Monogenic Diabetes | Hormone Health Network.” Hormone.org, Endocrine Society, 5 April 2021, https://www.hormone.org/diseases-and-conditions/diabetes/monogenic-diabetes